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MASSIVELY PARALLEL ALGORITHMS FOR THE LATTICE 
BOLTZMANN METHOD ON NON-UNIFORM GRIDS 

FLORIAN SCHORNBAUMt AND ULRICH RUDEt 


Abstract. The lattice Boltzmann method exhibits excellent scalability on current supercomput¬ 
ing systems and has thus increasingly become an alternative method for large-scale non-stationary 
flow simulations, reaching up to a trillion (10^^) grid nodes. Additionally, grid refinement can lead 
to substantial savings in memory and compute time. These saving, however, come at the cost of 
much more complex data structures and algorithms. In particular, the interface between subdo¬ 
mains with different grid sizes must receive special treatment. In this article, we present parallel 
algorithms, distributed data structures, and communication routines that are implemented in the 
software framework waLBerla in order to support large-scale, massively parallel lattice Boltzmann- 
based simulations on non-uniform grids. Additionally, we evaluate the performance of our approach 
on two current petascale supercomputers. On an IBM Blue Gene/Q system, the largest weak scaling 
benchmarks with refined grids are executed with almost two million threads, demonstrating not only 
near-perfect scalability but also an absolute performance of close to a trillion lattice Boltzmann cell 
updates per second. On an Intel-based system, the strong scaling of a simulation with refined grids 
and a total of more than 8.5 million cells is demonstrated to reach a performance of less than one 
millisecond per time step. This enables simulations with complex, non-uniform grids and four million 
time steps per hour compute time. 

Key words, lattice Boltzmann method, grid refinement, non-uniform grids, supercomputing, 
scalable parallel algorithms, parallel performance, LBM, HPC, CFD 
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1. Introduction. In the last decade, the lattice Boltzmann method (LBM) has 
gained popularity as an alternative to classical Navier-Stokes solvers for computational 
fluid dynamics (CFD) [1, 14]. For simulations with the LBM, the simulation domain 
is typically discretized with a uniform Cartesian grid. If the resolution of a three- 
dimensional simulation must be increased in space and time, then the total number 
of cells and the computational cost increase quickly. 

Many of the existing frameworks for the LBM [15, 18, 26, 30, 33, 39, 45, 49] are 
therefore designed for parallel computers where many show excellent scalability, i.e., 
their performance increases linearly with the number of processors employed. Going 
beyond just scalability, a carefully crafted architecture-aware implementation of the 
LBM as realized in the waLBerla framework [24] can achieve excellent absolute 
performance and thus reduce the time to solution and the power consumption to 
reach a given computational objective. Using globally uniform Cartesian grids, i.e., 
using the same grid resolution for the entire simulation, it is possible to discretize flow 
geometries with in excess of 10^^ lattice cells on current petascale supercomputers [24]. 

However, for certain simulations, only parts of the entire domain require high 
resolution. In order to focus the computational resources on those regions that need 
high accuracy, many advanced methods in CFD rely on grid refinement. For in¬ 
corporating grid refinement into the LBM, node-based [20, 34, 37, 46, 52, 53] and 
volume-based [13, 27, 30, 44, 50] approaches have been proposed. 

In node-based approaches, the distribution functions of the LBM are located at 
the nodes of the lattice cells. At the interface between two different grid resolutions, 
fine grid nodes either coincide with or are located exactly halfway between coarse 
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nodes. For the LBM scheme of [52], a parallel extension was proposed in [21]. In node- 
based approaches, the non-equilibrium part of the distribution functions typically 
must be rescaled at the interface between different grid resolutions. 

In volume-based approaches, the distribution functions of the LBM are located 
at the center of lattice cells. During refinement, coarse cells are uniformly subdivided 
into multiple hner cell. As a result, the centers of hne and coarse cells will not coin¬ 
cide. Volume-based grid refinement approaches for the LBM allow formulations that 
guarantee the conservation of mass and momentum without the need to rescale the 
non-equilibrium part of the distribution functions. For volume-based grid refinement, 
[44] and [30] presented parallelization approaches that are suitable for large-scale sys¬ 
tems. In order to achieve scalability to these systems, both approaches are based on a 
tree partitioning of the simulation space. [30] introduces the MUSUBI software that 
relies on a linearized octree, [44] uses the Peano framework [10] that is based on a 
more generalized spacetree concept. An octree-based domain partitioning approach 
was also proposed by [19], however, not for incorporating grid refinement but for 
better fitting block-based data structures to complex geometries. To our best knowl¬ 
edge, other popular simulation codes based on the LBM such as Palabos [37, 45], 
OpenLB [32, 33, 36], LB3D [25, 39], HemeLB [26], HARVEY [49], or LUDWIG [15] 
are at a state that they either do not support grid refinement, only support grid re¬ 
finement for 2D problems, or have not yet demonstrated large-scale simulations on 
non-uniform grids. 

In this article, we present a volume-based refinement approach that consists of a 
distributed memory parallelization of the algorithm described in [50] combined with 
the interpolation scheme suggested by [13]. The main goal of the implementation into 
the software framework waLBerla is to ensure applicability for current petascale 
and future exascale supercomputers. To this end, all algorithms and data structures 
are carefully designed with high node level efficiency as well as scalability to mas¬ 
sively parallel systems in mind. We demonstrate simulations on non-uniform grids 
with multiple different grid resolutions and close to one trillion (10^^) cells that run 
on current petascale systems and make use of up to nearly two million concurrent 
threads. We also show that our implementation can reach throughput rates of more 
than 1,000 time steps per second in a strong scaling scenario, enabling simulations 
that perform several million time steps per hour compute time. To the best knowl¬ 
edge of the authors, the total number of cells that can be handled as well as the 
overall performance achieved significantly exceed the data published for the LBM on 
non-uniform grids to date [21, 30, 40, 51]. In addition to the algorithms and data 
structures, we also propose a method for scaling the two relaxation parameters of the 
two-relaxation-time collision model across different grid resolutions. 

The remainder of this article is organized as follows. In section 2, we give a brief 
introduction to the LBM including details about all the necessary extensions that are 
required for incorporating grid refinement. In section 3, we describe new data struc¬ 
tures and concepts that we added to the waLBerla framework in order to extend the 
framework to support efficient, massively parallel simulations on non-uniform grids. 
Details on the implementation of our parallel algorithm for the LBM, communica¬ 
tion, and load balancing follow in section 4. In section 5, we verify the correctness of 
our parallel scheme for two different Poiseuille flow scenarios and present weak and 
strong scaling benchmarks that demonstrate the performance of our approach on two 
petascale supercomputers. We conclude the article in section 6. 
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D3Q19 



D2Q9 


Fig. 1. Schematic visualization of one lattice cell for the three-dimensional D3Q19 and the two- 
dimensional D2Q9 model. For D3Q19, no data is exchanged across cell comers, but only across 
edges and faces. While all simulations of this article use the D3Q19 model, all later illustrations 
are in 2D in order to facilitate the schematic visualization of the methods. 


2. The lattice Boltzmann method. The LBM uses an explicit time stepping 
scheme that is well suited for extensive parallelization due to its data locality. In each 
time step, information is only exchanged between neighboring grid cells. Each cell 
stores N particle distribution functions (PDFs) 

/„ :DxT^ [0,1], a = 0,...,iV-l, 

where D C and T = [0, tend] are the physical and time domain, respectively. In this 
article, we employ the three-dimensional D3Q19 model as it was originally developed 
by Qian, d‘Humieres, and Lallemand [48] with TV = 19 PDFs stored in each grid cell 
(see Figure 1). In general, the lattice Boltzmann equation can then be written as 

/„(Xi -I- eaAt,t + At) - fa{y^i,t) = Ca(f), 

with Xi denoting the center of the i-th cell in the discretized simulation domain, ea 
denoting the discrete velocity set {e^la = 0,..., TV — 1}, t denoting the current time 
step. At denoting the time step size, and Ca{i) denoting the collision operator of the 
LBM. Algorithmically, the LB equation is typically separated into a collision (2.1a) 
and a streaming step (2.1b) 

(2.1a) fa{Xt,t) = fa{Xt,t) +Ca{i), 

(2.1b) fa{xi + eaAt,t + At) = fa{:iit,t), 

with fa denoting the post-collision state of the distribution function. During stream¬ 
ing, PDFs are exchanged only between neighboring cells while the collision step is 
a local operation in each cell. The three most commonly used collision schemes are 
the single-relaxation-time (SRT/LBGK) model [8], the two-relaxation-time (TRT) 
model [22, 23], and the multiple-relaxation-time (MRT) model [16, 17]. In this ar¬ 
ticle, we focus on the SRT and the TRT model. For the SRT model, the collision 
operator is given by 

Ca(f) = - —(/a(Xi,t) - fff{Xi,t)) = -a;(/a(x,,t) - fff{x,,t)), 

T 

where r is the relaxation time, w €]0, 2[ the dimensionless relaxation parameter, and 
fff {xi , t) is the equilibrium distribution. The relaxation time r and the relaxation 
parameter w are related to the kinematic viscosity v with 
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where c = Ax/At is the lattice velocity and Cs = c/\/3 is the speed of sound for an 
isothermal fluid [1, 14]. For the incompressible LBM [31], the equilibrium distribution 
function can be calculated according to 
(2.3) 


= Wa 


+ po 


3ea-u(xi,t) 9(e„ • u(xi,t))2 3u{x„t) 


2 \ 1 


2c4 


2c2 


with Wa denoting the lattice model-specific weighting factor corresponding to and 
Po denoting the reference density. p{:x.i,t) and u(a;j,t) are the fluid density and the 
fluid velocity, respectively. They are calculated from the first two moments 

(2.4) p(xi,<) = V/®«(xi,t) and u(x,,t) = — Vea/®'J(x,,t). 


For the TRT model, the distribution functions are split into a symmetric (even) and 
an asymmetric (odd) part 

ft = ^(/« ± /a) ^ = lifft ± fit), 

with a denoting the inverse direction of a. The corresponding collision operator is 
given by 

Ca{f) = -Xeift - f!^^) - Ao(/- - Fan, 

with Ae S]0, 2[ and Ao SjO, 2[ denoting the even and the odd relaxation parameter of 
the TRT model. If Ae = Ao = w = At/r, the TRT model coincides with the SRT 
model. The kinematic viscosity is related to Ag just like it is related to uj when using 
the SRT model. Hence, for the TRT model, 

(2.5) u = cl At. 

Typically, implementations of the LBM are formulated in a dimensionless lattice space. 
In order to set up simulations and evaluate the results, physical quantities must then 
be transformed from physical space into dimensionless lattice units, and vice versa [38]. 

2.1. External forces. External forces can be incorporated into the LBM by 
including an additional term in the collision step (2.1a) 

fa (Xi ,t) = fa (Xi ,t)+Cait)+ AtFa . 


There exist various different force models for the LBM [43]. Depending on the force 
model, the calculation of the fluid velocity (2.4) and/or the velocity terms in (2.3) 
must be adapted. In order to incorporate a constant body force F caused by a globally 
constant acceleration a into the simulation, we use 

Cq ■ F Gq, ■ a 

Fa = Wa - 5 — = WaPO - 5 —• 

cj ci 

For the evaluation of the equilibrium distribution (2.3) during the collision step, we 
do not change the calculation of u(xi,t), meaning we use an unmodified version 
of (2.4). However, when calculating the fluid velocity independently of the equilibrium 
distribution, we add an additional term to (2.4) [23] 

u(xi,t) = —'^eafa\Xi,t) + 

PO ^ 2po 
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Fig. 2. Boundary cells are marked in gray (1). After the collision step and prior to streaming, 
the boundary treatment algorithm stores PDF values inside the boundary cells itself (2). During 
streaming, no special treatment has to be performed for cells next to the boundary: All fluid cells 
can pull PDF values from all of their neighbors (3), even if the neighboring cell is a boundary cell. 


2.2. Boundary treatment. Domain boundaries and obstacles within the fluid 
must receive special treatment. We mark every cell of the grid as either a fluid cell, a 
boundary/obstacle cell, or a cell that is outside of the simulation domain and hence 
does not need to be considered during the simulation at all. Depending on which 
boundary marker is set for a certain cell, a different boundary treatment is performed 
during a post-collision/pre-streaming step. After the collision step, PDF values are 
calculated depending on the corresponding boundary conditions and are saved inside 
the boundary cells. During the subsequent streaming step, these values are pulled by 
their neighboring fluid cells (see Figure 2). 

As a result of this approach, the algorithms performing the collision and the 
streaming step can remain unchanged since during collision and streaming, no special 
operations are necessary near boundary cells. Because of boundary treatment being 
a pre-streaming operation, even existing fused stream-collide compute kernels (see 
section 4.1) do not require any changes. Periodic boundaries are treated by copying 
PDF values from one side of the simulation domain to the other, and vice versa. For 
parallel simulations, the communication step that normally synchronizes neighboring 
subdomains that reside on different processes can also be used to process periodic 
boundaries by establishing a communication channel between opposing sides of the 
simulation domain. 


2.3. Grid refinement. In this article, we present an efficient, highly optimized 
parallelization of the algorithm proposed by [50] for the LBM on non-uniform grids. 
The algorithm presented in [50] relies on a 2:1 balance between neighboring grid cells 
at the interface between two grid levels 


Axl-hi = 


Axl 

2 


Axl = 


Axo 
2 ^ ’ 


with L denoting the grid level and L = 0 corresponding to the coarsest level. The 
jump in L between a cell and all of its neighboring cells is at most 1. A non-uniform 
grid conforming with these features is illustrated in Figure 3. We apply acoustic 
scaling where the time step ratio is proportional to the grid spacing ratio [29, 51]. 
Thus, due to the grid spacing ratio of 2:1, twice as many time steps must be executed 
on level L -|- 1 compared to level L. Hence, Atp = Ato/2^. As a result, the speed 
of sound Cs = cj^fZ with c = AxjAt remains constant on each grid level. The 
kinematic viscosity u (2.2) must also remain constant. In order to keep v constant. 
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the dimensionless relaxation parameter w must be properly scaled to each grid level 

^ _ 2uJo 

2i+i + (l-2iVo’ 

or more generally: 

Since for the TRT model, the kinematic viscosity is related to Ag just like it is related 
to uj when using the SRT model (2.5), we use the same equations (2.6a) and (2.6b) 
for scaling Ae to different grid levels. In order to choose Ao, we use 



(2.6a) 

(2.6b) 


with the “magic” parameter Ago suggested in [23]. We propose to keep Ago constant 
on all levels and scale Ao on each level accordingly 


( 2 . 8 ) 


4 - 2Ag,L 

2 + (4Ago — l)Ag_L 


When applying a force caused by a constant acceleration a, we use the parametrization 
of the acceleration in order to calculate level-dependent lattice acceleration (in 
dimensionless lattice units [38]) 


(2.9) 


ar = a 


(^ 

Axl 


i*]^ = or more generally: 


* _ nK-L * 


3. Parallelization concept and realization. In the following subsections, we 
give a brief introduction to the WALBerla framework, present our domain partition¬ 
ing concept based on a 2:1 balanced forest of octrees space decomposition, introduce a 
corresponding two-stage initialization approach that proves to be useful for scalable, 
massively parallel simulations, and describe the new communication patterns that are 
required in order to support non-uniform grid structures. 

3.1. The waLBerla simulation framework. At its core, waLBerla is a 
general-purpose HPC software framework that is capable of supporting different nu¬ 
merical methods by providing generic, extensible concepts for domain partitioning, 
input and output, parallelization, and program flow control. The main focus of the 
framework is on massively parallel CFD simulations based on the lattice Boltzmann 
method [2, 3, 5, 9, 24, 47]. It is written in C-I--I-, supports all major compilers, and can 
scale from laptops and conventional desktop computers up to the largest supercom¬ 
puters with millions of concurrent threads. Besides parallelization with only OpenMP 
for shared memory systems or only with MPI for distributed memory, the framework 
also supports hybrid parallel execution where multiple OpenMP threads are executed 
per MPI process. Fine-tuned, vectorized compute kernels combined with a perfor¬ 
mance engineering approach that takes into account the specifics of the underlying 
architecture guarantee high performance [6, 24]. 

3.2. Forest of octrees domain partitioning. In waLBerla, the simulation 
space is decomposed into blocks [18]. These blocks can be assigned arbitrary data. 
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Fig. 3. Without refinement (1), the simulation space is uniformly decomposed into blocks. In 
order to support refinement (2), blocks can be further subdivided into equally sized smaller blocks: 
2 X 2 = A in this 2D illustration and 2x2x2 = S in our 3D simulations. In (1) as well as in (2), 
each block is assigned a grid of the same size (last picture of each row). As a result, transitions 
between different grid resolutions only occur on the boundary of blocks. For our massively parallel 
algorithm for the LBM on nun-uniform grids, exactly this kind of grid allocation strategy is used. 
However, the core data structures of waLBerla also support arbitrarily sized grids for each block, 
enabling different grid structures for other kinds of simulations. 


meaning each block can store an arbitrary number of different C-|—h data structures. 
These can be classes provided by the framework, other user-defined classes, STL 
containers, etc. In a parallel environment, one block is always assigned to exactly one 
process. By assigning a block to a specific process, this process becomes responsible 
for the part of the simulation space that corresponds to this block. One block cannot 
be assigned to multiple processes, but multiple blocks can be assigned to the same 
process. As a consequence, these blocks are the smallest entity that can be used for 
workload distribution, meaning load balancing is achieved by distributing all blocks 
to all available processes. For simulations without grid refinement, the simulation 
domain is uniformly decomposed into blocks and each block stores a Cartesian grid 
that corresponds to its part of the simulation domain (see Figure 3). Typically, 
exactly one block is assigned to each process and ghost layers of the grid on each 
block are used during the communication in order to synchronize neighboring blocks 
in parallel simulations (see section 3.4). Relying on a block-based rather than a cell- 
based domain partitioning is a parallelization approach that is also adopted by other 
software frameworks for the LBM [19, 36]. 

In order to achieve our goal of high node level efficiency and scalability to mas¬ 
sively parallel systems combined with support for simulations on non-uniform grids, 
the domain partitioning concepts described in [18] must be extended significantly. A 
fully distributed data structure is required for scalability to hundreds of thousands of 
processes and beyond. An essential property of such a fully distributed data structure 
is independence of the per-process memory requirements for managing the data struc¬ 
ture itself from the total number of processes. If two processes are assigned the same 
number of blocks with the same amount of data that is associated with each block, 
both processes must use the same amount of memory during simulation, regardless 
of whether the simulation runs with hundreds, thousands, or millions of processes. 

For simulations that rely on grid refinement, we implemented a distributed forest 
of octrees data structure that strictly follows these design principles and results from 
further subdividing the blocks of the initial Cartesian domain partitioning into eight 
equally sized smaller blocks (see Figure 3). Using this data structure, the simulation 
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Fig. 4. Schematic 2D illustration of the domain partitioning introduced in Figure 3(2). The 
largest blocks, i.e., the blocks of the initial Cartesian decomposition, act as roots for individual oc¬ 
trees. As such, they create a forest of octrees visualized on the left (note that quadtrees in this 
2D illustration correspond to octrees in 3D). During simulation, this forest of octrees is not stored 
explicitly, but it is defined implicitly by assigning each block a unique block ID that allows to recon¬ 
struct the block’s exact location within the forest of octrees. Additionally, every block stores a pair 
of type (block ID, process rank) for every neighbor, thereby creating a distributed adjacency graph 
for all blocks (illustrated on the right). 


domain is still partitioned into blocks. However, blocks now correspond to leaves of 
the forest of octrees (see Figure 4). Asa result, this new domain partitioning structure 
supports the re-use of existing algorithms. The developer must usually only define 
algorithms that work on blocks, regardless of whether the domain is uniformly parti¬ 
tioned into blocks or a more complex octree partitioning is used. Only few algorithms 
that are closely related to how the blocks are arranged in space, as for instance parts 
of the communication layer (see section 3.4), require a second implementation tailored 
to the specifics of the forest of octrees data structure. If we enforce 2:1 balanced oc¬ 
trees, the resulting block structure perfectly mirrors the 2:1 balanced grid structure 
required for the refinement scheme for the LBM described in section 2.3 and as such 
proves to be a suitable domain partitioning scheme. At this point we would like to 
remark that the idea of first partitioning the simulation domain into blocks using an 
octree approach and later creating Cartesian meshes inside these blocks was recently 
also adopted by a new software project: ForestClaw [11]. ForestClaw uses p4est [12] 
for domain partitioning, a library that already demonstrated scalability to massively 
parallel systems and, being based on a distributed forest of octrees implementation, 
shares similarities with our approach. 

An important, unique feature of our implementation is the fact that every block 
is aware of all of its spatially adjacent neighbor blocks, effectively creating a dis¬ 
tributed adjacency graph for all blocks (see Figure 4). As a result, we can not only 
execute algorithms typically associated with octrees but also algorithms that operate 
on more general graphs. The spatial partitioning of the simulation domain, however, 
geometrically always corresponds to a forest of octrees. Parent nodes or parent-child 
relationships do not need to be saved explicitly, the tree structure is implicitly defined 
by block IDs. Block IDs are used to uniquely identify blocks within the distributed 
data structure. They allow to reconstruct a block’s exact location within the forest of 
octrees. During the simulation, each process only knows about blocks that are stored 
locally and blocks that are neighboring these process-local blocks. For all non-local 
neighbor blocks, this knowledge is only comprised of a globally unique block ID and 
a corresponding process rank. Generally, processes have no knowledge about blocks 
that are not located in their immediate process neighborhood. As a consequence, the 
memory usage of a particular process only depends on the number of blocks assigned 
to this process, and not on the global number of blocks that are available for the 
entire simulation. This kind of fully distributed data structure enables scalability of 
simulations working on non-uniform grids to massively parallel systems (see section 5). 
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Fig. 5. Initialization starts with a uniform block partitioning (2). Blocks not needed for the 
simulation can be discarded (3). The load balancing step then assigns blocks to processes (4)- If 
refinement is required, in-between step (3) and (4), some - or all - of the initial blocks are further 
subdivided into finer blocks (see Figure 3). The resulting domain partitioning can be stored to file in 
order to be used as the starting point of the actual simulation (5). If the initialization runs during 
the actual simulation, step (5) will follow immediately after step (4) without the intermediate storage 
to file. In (5), the grid used for the computation is allocated locally on each process. 


3.3. Two-stage initialization. During the initialization, the construction of 
the data structure starts with a uniform block grid (see Figure 5). Each of these 
initial blocks acts as the root of an octree (see Figure 4). During an iterative process, 
blocks are then uniformly divided into eight smaller blocks as long as they are marked 
for refinement by user-defined callback functions. Our algorithms ensure a 2:1 bal¬ 
ance between neighboring blocks by also subdividing blocks that have not explicitly 
been marked for refinement by the user or the application. If no such callback func¬ 
tions are registered, the initial block grid remains unchanged and can act as a basis 
for simulations that only need a uniform block partitioning of the simulation space. 
Blocks located in parts of the initial space not needed during the simulation can be 
completely removed from the data structure, allowing the block partitioning to handle 
any kind of geometry and to adapt to arbitrarily shaped domains (see Figure 5). 

During initialization, no actual data is allocated, only two values are stored for 
each block: memory requirement and workload. These values are set by a user-defined, 
application-dependent callback function and are then used in the subsequent load 
balancing step. For load balancing, different strategies based on space filling curves 
and on the graph partitioning library METIS [35] are available. The weight of each 
block corresponds to the assigned workload. The task of the load balancing algorithm 
is to assign a target process ID to each block such that all blocks are distributed 
equally among all processes with respect to their weight. For the balancing strategy 
that is based on space Filing curves, we first create a linearized array that contains 
all blocks by traversing all trees in the forest of octrees and then we divide this array 
into contiguous pieces of equal weight [12]. When using METIS, we transform the 
adjacency graph containing all blocks (see Figure 4) to the graph data structure used 
by METIS. Via user-defined callback functions we allow to also specify weights for all 
block-block connections (= edges in the adjacency graph). Typically, these weights 
will correspond to the expected communication volume. This kind of graph-based 
load balancing on a block level was also proposed by [19]. For more details on our 
load balancing process in combination with LBM-based simulations see section 4.5. 
The end of the load balancing step also marks the end of the initialization phase. 
Not needing the entire, fully resolved grid during load balancing is a major advantage 
and enables very large domains for current, massively parallel supercomputers. This 
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hierarchical approach of first dividing the simulation domain into blocks and later 
filling these blocks with corresponding parts of the global grid allows us to handle 
computational grids that consist of trillions of cells [24]. 

The initialization phase can be part of the actual simulation or it can be run 
prior to the simulation on a completely different machine with a different number of 
processes (see Figure 5). On a typical desktop computer, the entire initialization phase 
takes a few milliseconds for several thousands of blocks and, depending on the chosen 
load balancing algorithm, several seconds for millions of blocks which are needed for 
massively parallel simulations. Running the initialization of the data structure prior 
to the actual simulation allows to fine-tune the domain partitioning by trying different 
load balancing strategies and varying corresponding configuration parameters. Once 
this results in a satisfying domain partitioning, the corresponding forest of octrees 
data structure can be saved to file. Running the actual simulation then starts by 
one process reading the file and broadcasting the binary file content to all the other 
processes. On a current IBM Blue Gene/Q supercomputer, this process of reading 
the file, broadcasting its content, and setting up the forest of octrees data structure 
on each process according to the information in the file only takes a few seconds 
for a simulation involving millions of blocks. In this particular case, the size of the 
corresponding file storing the forest of octrees structure is typically in the range of 100 
to 200 MiB. As a result of this two-stage approach, the actual simulation can start 
almost immediately and as such contrasts favorably with alternative initialization 
strategies that, at least initially, require the entire, fully resolved grid for partitioning 
and therefore can be much more time and memory consuming. 

3.4. Inter-block communication. In a parallel environment, the waLBerla 
framework takes care of the necessary communication between the blocks of the do¬ 
main partitioning by providing an extensible, feature rich communication layer that 
can be adapted to the needs and communication patterns of the underlying simula¬ 
tion. Generally, the communication is organized into two steps: packing data into and 
unpacking data from buffers which are exchanged between processes. If two blocks 
reside on the same process, the communication layer allows data to be copied directly 
without a call to MPI and without the intermediate exchange of a buffer. For the 
plain LBM, communication is only performed between spatially adjacent blocks. For 
the rest of this article, we refer to two spatially directly adjacent blocks as being con¬ 
nected via a corner if both blocks only intersect in one isolated point. If both blocks 
intersect in a line, we refer to them as being connected via an edge. Analogously, if 
they intersect in a surface, we refer to them as being connected via a face. 

For parallel simulations with the LBM, the grid assigned to each block is extended 
by at least one additional ghost layer that is used to synchronize the data. During 
communication, PDF values stored in the outermost inner cells are copied to the 
corresponding ghost layers of neighboring blocks (see Figure 6). 

Depending on the lattice model, exchanging PDF values with fewer neighbors 
may be sufficient. If D3QI9 is used, typically, PDF values must be exchanged only 
with neighboring blocks connected via a face or an edge, but not across a corner. 
Additionally, in most simulations, not all PDF values must be communicated, but 
only those streaming into the neighboring block (see Figure 6). For D3Q19, out of 
the 19 PDF values stored in each cell, five PDF values must be exchanged for every cell 
on a face-to-face connection and only one PDF value for every cell on an edge-to-edge 
connection. On average, this reduces the amount of data that must be communicated 
by a factor of 4. It should be noted that the modular design of the communication 
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Fig. 6. 2D example of the LBM with lattice model D2Q9: Every block contains a grid of size 
2x2 with one additional ghost layer. On edges, three PDF values must be exchanged. On corners, 
only one PDF value is communicated. 



Fig. 7. For the forest of octrees data structure, three different communication patterns exist: 
data exchange between blocks of the same size (1), fine-to-coarse communication (data packed on 
small and unpacked on larger block) (2), and coarse-to-fine communication (data packed on large 
and unpacked on smaller block) (3). 


layer allows to adapt all communication steps to the specific needs of the application. 
If an application requires more PDF data to be present in ghost layers, always also 
more data can be exchanged between processes. 

For the forest of octrees data structure, we had to implement a new communi¬ 
cation scheme that consists of three different communication patterns (see Figure 7): 
(i) data exchange between two blocks of the same size on the same level, (ii) send¬ 
ing data to a larger block on a coarser level (fine-to-coarse communication), and (iii) 
sending data to a smaller block on a finer level (coarse-to-fine communication). De¬ 
tails on the implementation of these different communication patterns for the LBM 
on nun-uniform grids follow in sections 4.2, 4.3, and 4.4. 

With and without refinement, always only one message is exchanged between two 
processes. This message may contain the data of multiple blocks. When operating in 
hybrid OpenMP and MPI mode, these messages can be sent and received in parallel 
by different threads. Packing data into buffers prior to sending and unpacking data 
after receiving can also be done in parallel by different threads if OpenMP is activated. 
The OpenMP implementation of the communication module follows a task parallelism 
model. First, a unique work package is created for every face, every edge, and every 
corner of every block. These work packages are then processed in parallel by all 
available threads. As a result, as long as all compute kernels are also parallelized 
using OpenMP (following a data parallelism model, see section 4), all major parts of 
the simulation are executed thread-parallel. 
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Fig. 8. The series of images above the dashed line illustrates one time step on the coarse grid 
and two corresponding time steps on the fine grid. The images below the dashed line show the exact 
same process, but here we illustrate the data distribution of our parallel implementation. Since grid 
level transitions coincide with block boundaries, we also visualize cells in ghost layers involved in 
the algorithm. Cells that change their content from one step to the next are marked with a gray, 
surrounding box. Once step (7) is reached, the state of the algorithm is again identical to step (1). 


4. The LBM on non-uniform grids. In our parallel implementation, grid 
refinement for the LBM is based on the forest of octrees block partitioning of the sim¬ 
ulation space described in the previous section. Each block is assigned a grid with the 
same number of cells in x-, y-, and z-direction (cf. Figure 3). As a result, the interfaces 
between subdomains of different resolution coincide with the block boundaries. 

Figure 8 schematically illustrates the algorithm proposed by [50] and our parallel 
implementation. Initially (I), the simulation is in a post-streaming state. Our parallel 
algorithm starts with performing a collision in coarse and fine cells (2). Cells in 
ghost layers are not included in the collision. The content of two coarse cells is then 
transferred to four ghost layers in the fine grid as indicated by the arrow between step 
(2) and (3). Next, streaming is performed in both grids, including the two innermost 
ghost layers of the fine grid (4). In step (5) and (6), another collision and streaming 
step is performed in the fine grid only. Streaming again includes the two innermost 
ghost layers. Finally, PDF values in these two innermost ghost layers are merged and 
transferred to the coarse grid as indicated by the arrow between step (6) and (7). 
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Algorithm 1: NonUniformTimeStep 


1 Function NonUniformTimeStepC^eiie/ L) 

2 forall the blocks on level L do CollisionStep 

3 if L 7 ^ finest level then 

4 I recursively call NonUniformTimeStepCL + 1) 

5 end if 

6 if L 7 ^ coarsest level then 

7 call Explosion (L, L — 1) 


8 

9 

10 

11 

12 


end if 

call Commuiiication (L) 

forall the blocks on level L do StreaminffStep 
if L 7 ^ finest level then 

call Coalescence (L, L + 1) 


13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 


end if 

if L = coarsest level then 
I return 
end if 

forall the blocks on level L do CollisionStep 
if L 7 ^ finest level then 
I recursively call NonUniformTimeStepd/ + 1) 
end if 

call Communication (L) 

forall the blocks on level L do StreamingStep 
if L 7 ^ finest level then 

call Coalescence (L, L + 1) 


25 


end if 


/* LBM collision step */ 
/* recursive call */ 

/* coarse-tO“fine communication */ 
/* initiated by fine level */ 

/* equal-level communication */ 
/* LBM streaming step */ 

/* fine-to-coarse communication */ 
/* initiated by coarse level */ 

/* end recursion */ 

/* LBM collision step */ 

/* recursive call */ 

/* equal-level communication */ 
/* LBM streaming step */ 

/* fine-to-coarse communication */ 
/* initiated by coarse level */ 


26 end 


The implementation and the methods used for transferring data between different 
grid levels are covered in detail in the following subsections. Note that from (6) to 
(7) not all PDF values are transferred from fine to coarse, but only those streaming 
into the coarse grid (marked red in the illustration). Also note that the ghost layer 
of the coarse grid is not required by this algorithm for fine-to-coarse or coarse-to- 
fine communication. As a consequence, four ghost layers are only needed for blocks 
with coarser neighbors. Also, four layers of cells are never communicated. At most, 
two layers are transferred during coarse-to-fine communication (see section 4.3). For 
fine-to-coarse communication and communication between two equally sized blocks, 
significantly less data must be transferred (see sections 4.4 and 4.2). 

4.1. Implementation. Algorithm 1 displays the basic structure of the recursive 
procedure that represents the program flow control for our approach of the LBM on 
non-uniform grids. To improve readability, functions that perform the collision or 
streaming step of the LBM are highlighted with a wavy underline and functions that 
involve communication are underlined with a solid line. The algorithm is called with 
L equal to zero. As a result, the simulation is advanced by one coarse time step {L 
equal to zero corresponds to the coarsest level). Function Explosion performs coarse- 
to-fine communication. In its most basic implementation, PDF values from coarse 
grid cells are homogeneously distributed to fine cells according to 
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where are all the cell centers on the fine grid that correspond to x,; on the coarse 
grid. Function Coalescence performs fine-to-coarse communication. Here, PDF val¬ 
ues stored in the fine grid are transferred to the coarse grid according to 


n 

ct,fine (xj , t) , 

J=r 



(4.2) 


a,coarse 


where n is the number of fine cells that correspond to one coarse cell (n = 4 in 
2D and n = 8 in 3D). Function Communication takes care of transferring PDF 
data between grids on the same level. Boundary conditions are executed as a pre¬ 
streaming operation (cf. section 2.2) and are realized at the beginning of function 
Streaming Step. Both the streaming and collision step are only performed in the 
interior part of the grid, not in ghost layers, with one exception: For blocks with 
at least one larger neighbor block with a coarser grid, the treatment of boundary 
conditions and streaming also include the two innermost ghost layers. 

The actual implementation includes further optimizations as outlined in Algo¬ 
rithm 2. On the finest grid level, the streaming and collision step (cf. lines 10 and 17 
of Algorithm 1) can be combined into one fused stream-collide operation (cf. line 17 
of Algorithm 2). Combining the streaming and collision step significantly reduces the 
amount of data that is transferred between CPU and main memory. Not combining 
both steps requires to load and store the data from and to memory twice. Fusing both 
steps allows, for each cell, to read PDF values from neighboring cells (= streaming), 
perform the collision step, and store the result back to memory [28]. Since most time 
steps are executed on the finest level, the majority of the workload is generated by the 
blocks on this level. Thus, improving the performance of the algorithm on the finest 
level is essential for achieving good overall performance. Function StreamCollide also 
includes the treatment of boundary conditions as a pre-streaming step. 

When hybrid parallel execution with OpenMP is used, functions CollisionStep, 
Streaming Step, and StreamCollide, which are always called on an entire block, fol¬ 
low a NUMA-aware data parallelism model, meaning the work that must be performed 
for every cell is distributed evenly among all available threads. Employing a data par¬ 
allelism model for the compute kernels for the LBM when using a block-based domain 
partitioning is also suggested in [32, 36]. 

In general, the aforementioned three functions may include any performance op¬ 
timizations as long as the employed optimization strategies only affect the current 
collision, streaming, or fused stream-collide step. For our simulations, we typically 
rely on compute kernels for the LBM that make use of a “structure of arrays” data 
layout, loop splitting to reduce the number of concurrent store streams, and SIMD 
processing [28]. Optimization techniques that do not allow separate collision and 
streaming steps like for example the AA pattern [4], which requires an adapted col¬ 
lision followed by a fused stream-collide-stream operation, are not straightforwardly 
compatible with the presented grid refinement scheme. As a consequence, we use a 
two-grid approach for storing the PDF data and performing the streaming step. 

As one further step of optimization, communication is executed asynchronously. 
In particular, this permits to overlap computation and communication. To this end, 
every communication step is separated into two phases. During the first phase, all 
message data is packed into a buffer and the communication is initiated by calling a 
non-blocking, asynchronous send function of MPI. Simultaneously, all the matching 
MPI receive functions are scheduled. In the second phase, the communication is final¬ 
ized by waiting for the MPI receive functions to complete. The actual transfer of data 
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Algorithm 2: NonUniformTimeStep (optimized version) 


1 

2 

3 

4 

5 

6 

7 

8 
g 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 


Function NonUniformTimeStep(^e?;e/ L) 

forall the blocks on level L do CollisionStep 
if L ^ coarsest level then 
I call InitiateExplosion (L, L — 1) 
end if 

call InitiateCommunication (L) 
if L 7 ^ finest level then 

recursively call NonUniformTimeStepd/ + 1) /* recursive call */ 

call InitiateCoalescence (L, L + 1) 
end if 

if L 7 ^ coarsest level then 

call FinishExplosion (L, L — 1) 

call ExplosionlnterpolationCD /* interpolation of exploded values */ 

end if 

call FinishCommunication (L) 
if L = finest level and L 7 ^ coarsest level then 
// fused LBM streaming & collision step 
forall the blocks on level L do StreamColld^^ 

else 

forall the blocks on level L do StreamingStep 
if L 7 ^ finest level then 
I call FinishCoalescence (L, L + 1) 
end if 

if L = coarsest level then 
I return 
end if 

forall the blocks on level L do CollisionStep 
end if 

call InitiateCommunication (L) 
if L 7 ^ finest level then 

recursively call NonUniformTimeStep(Z/ + 1) /* recursive call */ 

call InitiateCoalescence (L, L + 1) 
end if 

call FinishCommunication (L) 
forall the blocks on level L do StreamingStep 
if L 7 ^ finest level then 
I call FinishCoalescence (L, L + 1) 
end if 


38 end 


is executed between these two phases. As a result, when reaching the second phase, 
the data that has already arrived can be unpacked and computation can continue 
immediately without stalling. The first communication between blocks on the same 
grid level, for example, is initiated in line 6 of Algorithm 2. The matching completion 
of this communication operation happens in line 15. While the corresponding data is 
transferred between processes, all finer blocks are processed as a result of the recursive 
call in line 8, coalescence is initiated, and explosion is finished. 

Finally, the homogeneous distribution of PDF values from one coarse cell to multi¬ 
ple finer cells (cf. (4.1)) is improved by a mass and momentum conserving interpolation 
scheme presented in [13]. After all necessary PDF values were sent from the coarse 
grid, this interpolation is carried out by function Explosioninterpolation (cf. lines 12 
and 13 in Algorithm 2). 

The algorithm proposed here allows grid level transitions to be present at fluid- 
obstacle interfaces if these interfaces are treated with simple bounce back (no slip) 
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or symmetric (free slip) boundary conditions. To this end, the treatment of bound¬ 
ary conditions is extended to the two innermost ghost layers on fine blocks at the 
interface to coarse neighbors. Furthermore, cells marked as fluid or obstacle cells 
must be marked consistently across the interface region where coarse and fine cells 
overlap. If, at the interface between different grid levels, a coarse cell is marked as 
being a fluid/obstacle cell, all eight corresponding fine cells must also be marked as 
fluid/obstacle cells. 

Even though the asynchronous communication scheme helps to improve perfor¬ 
mance, it is still crucial for parallel performance to implement all communication 
patterns with as little data transfer as possible. In the following subsections, we will 
describe how the parallel communication is optimized in the three possible situations, 
i.e., when communication occurs between blocks on the same level, when data is 
transferred from coarse to fine grids, and vice versa. 

4.2. Communication between blocks on the same level. For the commu¬ 
nication between blocks on the same level, we first mnst check whether or not there 
exists a larger block with a coarser grid in direct proximity of the two blocks that 
need to communicate. We refer to a block C as being in direct proximity of two 
blocks A and B if there exists a non-empty intersection, A fl i? fl C 7 ^ 0, of all three 
blocks. For the remainder of this section, we assume that A and B are adjacent blocks 
on the same level that need to communicate. Our distributed data structure stores 
information about process-local blocks and all blocks in the immediate neighborhood 
of these local blocks (cf. section 3.2). As a result, checking the neighborhood of a 
block for the existence of a larger block with a coarser grid is a strictly process-local 
operation. 

If no larger blocks are detected, i.e., if all blocks C with A fl B fl C 7 ^ 0 are either 
smaller blocks with a finer grid or blocks on the same level as A and B, then the same 
communication scheme as used for simulations without grid refinement is applied. 
Both blocks exchange one ghost layer and for each cell only those PDF values that 
stream into the neighboring block are communicated (cf. section 3.4 and Figure 6 ). 

If, however, there exists a larger block C with A D B D C ^ $ that contains a 
coarser grid than blocks A and B, then the two innermost ghost layers of A and B 
must be included in the streaming step and in the treatment of boundary conditions. 
Additionally, we must guarantee that after two consecutive streaming steps on A and 
B, all PDF values in these two ghost layers that are about to be sent to the coarse 
neighbor are still valid. Thus, the communication with a neighbor on the same level 
must be adapted in direct proximity of a larger neighbor block C that contains a 
coarser grid. If PDF data is transferred across a corner or an edge, now two layers of 
cells are communicated, and for each cell, all PDF values are sent. For communication 
across a face, however, two layers of cells are only used along the edges of the face. On 
the inside of the face, only one layer is used. For these interior cells, only those PDF 
values that are about to stream into the neighboring block are sent. This approach is 
illustrated in Figure 9 for 2-dimensional simulations (note that faces in 3D correspond 
to edges in 2D). 

In summary, communication between blocks on the same level is, in general, 
quite similar to the well optimized communication scheme used in simulations without 
refinement. Only in direct proximity of larger neighbor blocks with coarser grids, 
slightly more data must be exchanged. Since, for 3-dimensional simulations, the 
amount of data exchanged between blocks is dominated by communication across 
faces, ensuring that these transfers operate with the best possible efficiency is crucial 
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Fig. 9. In direct proximity of a larger neighbor block with a coarser grid, for simulations in 
2D, blocks on the same level exchange two layers of cells when communicating across a corner 
(3D: corners and edges). When communicating across an edge (3D: face), two layers of cells are 
only transferred on the comer of the edge. On the inside of an edge, only one layer is used and only 
those PDF values that are about to stream into the neighboring block are sent (highlighted in blue). 



Fig. 10. During coarse-to-fine communication, in order to also have valid data in all corners 
and edges of neighboring fine blocks (cells highlighted in gray), certain coarse cells must be sent to 
multiple fine neighbors. Thus, parts of the coarse block that must be transferred will overlap. 


for high performance. In conclusion, communication between blocks on the same level 
never requires to send more than two layers of cells. In fact, almost always only one 
layer of cells that only contains those PDF values streaming into the neighboring 
block is sent. 

4.3. Coarse-to-fine communication. During coarse-to-fine communication, 
coarse blocks send corner values and the appropriate parts of edges and faces to their 
finer neighbor blocks. Some coarse cells must be sent to multiple fine neighbors. Thus, 
parts of the coarse block that must be transferred will overlap (see Figure 10). For 
coarse-to-fine communication, the content of two cell layers is sent from the coarse grid 
to the fine grid (see Figure II). On the fine grid, these PDF values are distributed 
to all four ghost layers. As mentioned in section 4.1, the implementation of this 
distribution uses a mass and momentum conserving interpolation scheme from [13]. 

During coarse-to-fine communication, all PDF values are transferred from the 
coarse to the fine grid. Since the treatment of boundary conditions is also performed 
on the two innermost ghost layers of the fine grid, no PDF values can be omitted. 
If only those PDF values are communicated that stream into the fine block, having 
a simple no slip boundary condition for an obstacle that spans across a grid level 
transition would lead to an inconsistent state. However, if grid level transitions only 
occur within regions that entirely consist of fluid cells, sending only one coarse cell 
instead of two and sending only those PDF values that stream into the fine block is 
sufficient for accurately performing coarse-to-fine communication. In careful parallel 
performance benchmarking, we did not observe any significant differences in parallel 
performance since the transfer of data is mostly hidden by the asynchronous com- 
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Fig. 11. For coarse-to-fine communication, two layers of cells are sent from the interior part 
of the coarse grid to the ghost layers of the fine grid. This communication includes all PDF values 
stored in these cells. 
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muiiication and since for most simulations, the communication time is dominated by 
communication between blocks on the same level. This is because the bulk of com¬ 
munication volume is equal-level communication. Furthermore, since most time steps 
are performed on the finest grid, the majority of communication takes place between 
blocks on the finest level. 

4.4. Fine-to-coarse communication. Fine-to-coarse communication is a post¬ 
streaming communication step. In contrast to coarse-to-fine communication and com¬ 
munication between blocks on the same level, here, data is read from ghost layers and 
written into the interior part of the grid (cf. steps (6) and (7) in Figure 8). The coarse 
grid needs information from the two innermost ghost layers of the fine grid. In order 
to reduce the amount of data that is communicated, all the fine cells that correspond 
to one coarse cell are already aggregated (cf. (4.2)) while packing the data on the 
sending side. On the receiving side, i.e., on the coarse grid, these aggregated PDF 
values are then copied to the appropriate part of the grid. 

The data of some fine cells, however, never needs to be transferred to the coarse 
grid because of another neighboring fine block whose communication already includes 
this data (see Figure 12). The following rule applies: If a fine block is connected 
to a coarse block via a face, these two blocks never communicate across edges or 
corners. Analogously, if a fine and a coarse block are connected via an edge, there 
is no communication across corners for these two blocks. As a result, all connections 
of image (2) in Figure 7 that are not available in image (3) do not perform any 
communication. In our parallel implementation of the LBM on non-uniform grids, 
coarse-to-fine as well as fine-to-coarse communication is only performed on connections 
illustrated in image (3) of Figure 7. 

Also, we must never send all PDF values stored in the fine cells, but only those 
streaming into the coarse block (see Figure 13). All other PDF values originate from 
the interior part of the coarse block and must not be overwritten. Also note that 
as a result of our parallelization scheme, some of the PDF values that arrive at the 
coarse block are sent redundantly from different fine source blocks (highlighted in red 
in Figure 13). 

4.5. Load balancing. An even distribution of the workload to all available 
processes is crucial for good scalability and high performance. The goal is to prevent 
processes from being idle while waiting for data from other processes. For LBM- 
based simulations, with and without grid refinement, each block is assigned the same 
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Fig. 12. If PDF values are sent from fine to coarse blocks^ they must be mapped to the right 
part of the coarse block. Cells highlighted in gray never have to be sent. In contrast to coarse-to-fine 
communication and communication between blocks on the same level, here, PDF values are sent 
from ghost layers of fine blocks to the interior part of coarse blocks. 
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Fig. 13. During fine-to-coarse communication, PDF values that originate from the interior 
part of the coarse block must not be communicated. Only PDF values that stream into the coarse 
block (highlighted in red and blue) are sent from ghost layers of fine neighbors. Some PDF values 
arrive multiple times from different fine source blocks (highlighted in red). 


number of lattice cells (cf. section 3.2 and Figure 3). Distributing these blocks to all 
available processes such that no process idles during execution is the central task of 
our load balancing algorithms. 

A naive load balancing scheme might look like as follows: Each block is assigned 
a certain weight. In LBM-based simulations, this weight must correspond to the 
block’s refinement level. Since for blocks on a fine level twice as many time steps are 
executed as for blocks on the next coarser level, the weight of a fine block must be 
twice the weight of an equivalent block on the next coarser level. As a result, just like 
the number of time steps, the weight of a block grows exponentially with its level. 
Distributing all blocks according to their weights is a perfectly suitable load balancing 
strategy for LBM-based simulations without grid refinement. 

Even if all blocks reside on the same level, their weights may vary since blocks 
that only consist of fluid cells typically generate more work than blocks that contain 
many cells covered by obstacles which can be skipped by most parts of the algorithm. 
Consequently, for plain LBM-based simulations, setting the block weights to a value 
that is proportional to the number of fluid cells proves to be a good choice. Since 
the weight of each block is calculated by a user-defined callback function, also more 
complex schemes that also take into account the work generated by non-fluid cells as 
proposed by [19] for a simulation of parts of the human respiratory system can be 
employed. Using the approach of relating the weight of a block only to the number of 
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fluid cells, high performance and excellent scalability for a simulation of the human 
coronary artery tree on current supercomputers has been demonstrated in [24]. This 
approach works well for LBM-based simulations without grid refinement since for each 
time step, all blocks can be processed completely independently. Synchronization is 
only necessary either at the beginning or at the end of each time step. 

However, perfectly distributing blocks according to their weight is not feasible 
anymore for a parallel LBM on non-uniform grids. Here, blocks that reside on differ¬ 
ent levels cannot be processed completely independently. At fixed stages during the 
algorithm, blocks of different levels must interact with each other via fine-to-coarse 
and coarse-to-fine communication. For best performance, a suitable load balancing 
scheme must take into account the structure of the algorithm including all points of 
communication. 

The load balancing scheme that best fits our algorithm processes all levels sepa¬ 
rately. For each level, all blocks that reside on this level are distributed to all available 
processes. Typically, most of the work is generated by the finest grid levels. As a 
consequence, perfectly distributing the work generated by these levels is crucial for 
maximal performance. In principle, many load balancing algorithms are suitable for 
this kind of problem and even a different load balancing algorithm might be used for 
each level. In practice, we do not vary the algorithm from level to level. Currently, we 
use algorithms that are either based on space-filling curves using Morton or Hilbert 
order or that make use of the graph partitioning library METIS [35] (see section 3.3). 
Even though blocks on different levels are distributed separately, blocks on the same 
level can still have different weights due to different fluid to obstacle cell ratios. The 
resulting different weights are taken into account during distribution of these blocks 
to all available processes. As a result, process idle times are minimized since our 
level-wise load balancing scheme perfectly fits the structure of the parallel algorithm 
for the LBM on non-uniform grids. The need for level-wise load balancing was also 
pointed out in [29], but has not yet been implemented. 

5. Validation and benchmarks. In the following subsections, we present two 
validation scenarios that verify the accuracy and correctness of our parallelization ap¬ 
proach for the algorithm proposed by [50]. Additionally, we evaluate the performance 
of our implementation in order to confirm that our data structures and algorithms are 
fully capable of efficiently performing massively parallel simulations. For an extensive 
discussion on the general accuracy of the method, we refer to the validation in [50]. 
We compute two different Poiseuille flow scenarios in order to verify the correctness 
of our parallel implementation. In contrast to [50], we use the TRT model instead 
of the SRT model in all our simulations. Concerning the two relaxation parameters 
of the TRT model, we propose to keep Aeo (see (2.7)) constant on all levels, scale 
Ae according to w in (2.6a) and (2.6b), and calculate Ao from Aeo and Ae according 
to (2.8) (see section 2.3). 

In order to enable comparability of different simulations, velocities are scaled dur¬ 
ing evaluation such that the maximum velocity given by the corresponding analytical 
solution is always equal to 1. For all simulations. Ago is set to %6, the Reynolds num¬ 
ber is fixed to 10, and the simulation space spans [0,1] x [0,1] x [0,1], which is equal to 
the fluid domain except for the pipe Poiseuille flow scenario where only a cylindrical 
part of the simulation space is covered by the fluid domain. As a consequence, the 
surface area of any cut perpendicular to the x-axis is either 1 for simulations with a 
rectangular or 1 / 4 - tt for simulations with a circular cross section (cf. Figure 14). The 


PARALLEL LATTICE BOLTZMANN ON NON-UNIFORM GRIDS 


21 



Fig. 14. Domain partitioning and flow profiles in the zy-plane for simulations of plane (left) 
and pipe (right) Poiseuille flow with grid refinement and four different grid levels. Only the block 
partitioning is shown. During simulation, each block consists of 10 x 10 x 10 cells. In the flow 
profile, high velocities are depicted in red, zero velocities in dark blue. 


volumetric flow rate Q is calculated as 

Q = u - A, 

with u denoting the mean flow velocity through the cross-sectional area A. In our 
volumetric approach for the LBM, flow velocities are evaluated at cell centers. We 
calculate weighted discrete and L°° norms of the error in the velocity by 

iterating all fluid cells in the domain and 


L^ = 'ffwi-Aui , , L°° = maxAui, 

i y i 

where Wi = is a grid level-dependent weighting factor that is equal to the 

volume of cell i and Aui = ||ui — Ui^exactW-, with denoting the flow velocity com¬ 
puted by the simulation and Ui exact denoting the corresponding velocity given by the 
analytical solution. 

Since entire cells are marked as either fluid or obstacle and we employ bounce back 
boundary conditions, boundaries are always located halfway in-between two cells. All 
simulations are run in 3D and make use of the D3Q19 lattice model. 

5.1. Poiseuille flow. For the simulation of Poiseuille flow, we drive the fluid 
with a body force caused by a constant acceleration (cf. section 2.1) that induces a 
flow in x-direction. The acceleration is scaled to different grid levels using (2.9). All 
walls are fixed and set to no slip boundary conditions. We expect a parabolic flow 
profile that is in accordance with Poiseuille’s law. 

We use two different setups (cf. illustration in Figure 14). By having no slip 
bounce back boundary conditions at y = 0 and y = 1 and periodic boundaries in x- 
as well as z-direction, we simulate plane Poiseuille flow. For pipe Poiseuille flow, we 
use a circular profile in the yz-plane and choose periodic boundaries in x-direction. 

For plane Poiseuille flow, the flow rate Q = u-A = ^Is-Umax- A = ^fs-Umax- Using 
the TRT model and Ago equal to ?46, we expect the simulation to be accurate on any 
uniform grid. Without grid refinement, the simulation converges to the analytical 
solution, i.e., the volumetric flow rate as well as the flow velocities evaluated at all 
cell centers match the analytical solution once the steady state is reached. If we add 
grid refinement, we can retain this convergence behavior to the steady state solution 
over time. When refining only at the plate on the bottom {y = 0), or at the plate on 
the top {y = 1), or in the middle between both plates, or at the bottom as well as 
at the top plate, convergence over time remains unchanged and we see convergence 




































Fig. 15. Velocity profiles evaluated along a line through the middle of the channel, parallel to 
the y-axia. The black line follows the analytical solution, the blue circles correspond to the velocities 
calculated by the simulation (only every 2nd data point is plotted). The graphs correspond to the 
two simulations depicted in Figure If. 




Fig. 16. L°° evaluated over the time of the simulation (simulation time in time steps on the 
finest grid). The black line corresponds to a simulation with the entire domain refined to the finest 
resolution. The blue circles correspond to a simulation with grid refinement with four different grid 
resolutions and the finest level only at fluid-boundary interfaces (cf. illustration of Figure If). We 
see the same agreement when evaluating and norms. 


to the analytical solution, i.e., L°° will drop to values close to machine accuracy (cf. 
first graph in Figures 15 and 16). 

For pipe Poiseuille flow with a circular cross section, the volumetric flow rate 
Q = u ■ A = 'kj 2 ■ Umax • A = i/s • Umax ' For the LBM with no slip conditions at 
curved boundaries, the expected convergence order is discussed in, e.g., [23]. Simi¬ 
larly, the accuracy at the refinement interface is studied in [50]. For this article, the 
circular profile is approximated by a staircase and we use the well-established mass- 
conserving first order bounce back condition, and thus expect errors that converge 
linearly with the resolution of the simulation. When uniformly refining the global grid, 
we observe the expected behavior. Errors decrease linearly the higher the resolution 
of the underlying grid, i.e., the more cells we use for modeling the flow inside the 
pipe (cf. “global refinement” in Table 1). Grid refinement can help to avoid resolving 
the entire domain with a finer grid. If we use finer grids only near the wall of the 
pipe (see second illustration in Figure 14), convergence over time remains unchanged 
(cf. second graph in Figure 16) and we obtain comparable accuracy with simulations 
that uniformly resolve the entire pipe with fine grids (see Figure 17). Refining only 
the center of the pipe has no impact on accuracy. In conclusion, the accuracy of 
the simulation is mainly determined by the resolution of the grid used to resolve the 
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Fig. 17. The error ||u — Uercact || of simulations of Poiseuille flow through a channel with circular 
cross section evaluated along a line through the middle of the channel, parallel to the y-axis. The 
first graph shows the error for simulations where the entire domain is refined either to level 0, 1, 2, 
or 3. The second graph shows the error for simulations where the finest level (1, 2, or 3) is only used 
to resolve the fluid-wall interface. Towards the center of the channel, the grid resolution becomes 
increasingly coarser, always reaching level 0 at the center (cf. third illustration in Figure If.). Small 
jumps in the velocity that are noticeable in the graph on the right are due to errors at the refinement 
interface and conform with the observations reported in [50]. 

Table 1 

Comparison of the accuracy of different simulations of pipe Poiseuille flow. The simulation 
was executed with different resolutions and with refining the entire domain to a certain level (global 
refinement), with grid refinement to different levels only close to the wall of the pipe (wall refine¬ 
ment), and with grid refinement to different levels only in the center of the pipe (center refinement). 
The corresponding refinement level is stated in brackets. On the coarsest level, the pipe consists of 
60 cells in diameter - leading to 120, 240, and fSO cells for level 1, 2, and 3, respectively. The 
linear decrease of the error conforms with the first order bounce back boundary conditions in use. 



fiow rate error 

Ri 


L°° 

global refinement (0) 
global refinement (1) 
global refinement (2) 
global refinement (3) 

7.96 X 10"3 
4.98 X 10-3 
1.86 X 10-3 
1.05 X 10-3 

3.48 X 10-3 
2.06 X 10-3 
0.77 X 10-3 
0.42 X 10-3 

4.28 X 10-3 
2.43 X 10-3 
0.90 X 10-3 
0.50 X 10-3 

19.2 X 10-3 
8.92 X 10-3 
4.59 X 10-3 
2.87 X 10-3 

wall refinement (1) 
wall refinement (2) 
wall refinement (3) 

5.14 X 10-3 
1.96 X 10-3 
1.13 X 10-3 

2.17 X 10-3 
0.84 X 10-3 
0.49 X 10-3 

2.54 X 10-3 
0.98 X 10-3 
0.57 X 10-3 

8.92 X 10-3 
4.59 X 10-3 
2.87 X 10-3 

center refinement (1) 
center refinement (2) 
center refinement (3) 

7.95 X 10-3 
7.92 X 10-3 
7.92 X 10-3 

3.46 X 10-3 
3.40 X 10-3 
3.39 X 10-3 

4.27 X 10-3 
4.21 X 10-3 
4.20 X 10-3 

19.2 X 10-3 
19.2 X 10-3 
19.2 X 10-3 


fluid-wall interface at the inside wall of the pipe. 

Results for all of these simulations of pipe Poiseuille flow are summarized in 
Table 1. Accuracy is evaluated after convergence of the flow to a steady state. For 
calculating the flow rate Q, we evaluate the flow through a cross sectional area in the 
middle of the channel parallel to the yz-plane. The relative error compared to the 
analytical solution of Q is shown in the second column of Table 1. and 

norms are listed in columns 3, 4, and 5. 

5.2, Performance evaluation. In order to evaluate the performance of mas¬ 
sively parallel simulations, we run benchmarks on two petascale supercomputers: 
JUQUEEN, an IBM Blue Gene/Q system ranked 11th in the TOP500 list^, and 


^November 2015 
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Table 2 

The memory requirements and workload of each grid level as well as the amount of space 
covered by every level in the lid-driven cavity performance benchmark. The memory requirements 
are proportional to the number of blocks (the finest level accounts for 59.81 % of all blocks). 



L = 0 

L = 1 

L = 2 

L = 3 

domain coverage ratio 

77.78 % 

16.67% 

4.17% 

1.39% 

workload share 

1.10% 

3.76% 

15.02 % 

80.13% 

memory/block share 

6.54% 

11.22% 

22.43 % 

59.81 % 


SuperMUC, a x86-based system build on Intel Xeon CPUs ranked 23rd in the same 
list. JUQUEEN provides 458,752 PowerPC A2 processor cores running at 1.6 GHz, 
with each core capable of 4-way multithreading. Based on our observations in [24], 
in order to achieve maximal performance, we make full use of multithreading on 
JUQUEEN by placing either four processes or four threads of the same process on 
one core. The SuperMUC system features fewer, but more powerful, cores than 
JUQUEEN. It is built out of 18,432 Intel Xeon E5-2680 processors running at 2.7 GHz, 
which sums up to a total of 147,456 cores. Due to the installation of a hardware up¬ 
grade^, the simulations for this article are limited to 4,096 CPUs (32,768 cores). On 
both systems, we use double-precision floating-point arithmetic and the highest com¬ 
piler optimization level available: level 5 for the IBM XL compiler on JUQUEEN and 
level 3 for the Intel compiler on SuperMUC. 

Eor evaluating weak and strong scaling performance of our parallel algorithm for 
the LBM on non-uniform grids, we make use of a synthetic benchmark that simulates 
lid-driven cavity flow in 3D with D = [0,1] x [0,1] x [0,1]. The benchmark uses a 
velocity bounce back boundary condition at z = 1 in order to simulate the moving lid. 
Eor all other domain boundaries, we use no slip bounce back boundary conditions. 
The regions where the moving lid meets the stationary domain boundaries are refined 
three times. The properties of the resulting refinement structure are listed in Table 2; 
they remain fixed for all performance benchmarks. 

Eor the evaluation of weak scaling performance, we assign four blocks of the 
finest level to every process. As a result, each process additionally holds one or two 
blocks of level 2, one or no block of level I, and one or no block of the coarsest 
level. On average, each process will hold 6.6875 blocks, independent of the total 
number of processes. When increasing the number of processes, the number of cells 
per block (and process) remains constant. As a consequence, the global number of 
cells increases linearly with the number of processes. For the evaluation of strong 
scaling performance, only one block of the finest level is assigned to each process. As 
a result, each process additionally holds either one or no block from the other levels. 
On average, each process will hold 1.6719 blocks. The more processes are used, the 
fewer cells are allocated per block (and process). As a consequence, the global number 
of cells remains constant, independent of the total number of processes. 

We report number of processor cores, not the number of processes. For a fixed 
number of cores, we can either run the benchmark with a processes (MPI only) or 
run a hybrid simulation with j processes and (3 OpenMP threads per process. On 
SuperMUC, we choose a to be equal to the number of cores. Since we make full use 
of multithreading on JUQUEEN, we choose a to be equal to four times the number 
of cores when running the benchmark on JUQUEEN. We compare the performance 


^January 2015 until summer 2015 
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Fig. 18. IFeafc (left) and strong (right) scaling performance on SuperMUC. For evaluating 
weak scaling performance, the benchmark uses 3.34 million cells/core. For evaluating strong scaling 
performance, the global number of cells is fixed to 8.56 million. Only the best results of three different 
parallelization strategies (MPI only, MPI+OpenMP with 4 threads/process, MPI+OpenMP with 8 
threads/process) are plotted. Detailed results are listed in Table 3. 


of simulations that rely on MPI only to the performance of hybrid simulations that 
use four (/? = 4) or eight (/3 = 8) threads, respectively. Since hybrid simulations use 
/? times fewer processes, we allocate /3 times more cells for each block (and process). 
As a result, for a given benchmark scenario and for a hxed number of processor 
cores (= fixed amount of hardware resources), the amount of work remains constant, 
independent of the parallelization strategy in use. 

For the weak scaling benchmark, we report MLUPS/sec, which stands for “mil¬ 
lion lattice cell updates” per second. For ideal weak scaling, MLUPS/sec must scale 
linearly with the number of processor cores. For the strong scaling benchmark, we 
report the number of time steps executed each second on the finest grid. The more 
time steps are executed each second, the higher the throughput of the simulation and 
the shorter the time to solution for a hxed problem size. For both benchmarks, we 
also report MLUPS/sec divided by the number of processor cores. When increas¬ 
ing the number of cores, MLUPS/sec-core is proportional to parallel efficiency. If 
MLUPS/sec-core remains constant, parallel efficiency is equal to I. For all measure¬ 
ments, we use the same compute kernels as described in [24]. These compute kernels 
make use of SIMD instructions on the IBM Blue Gene/Q architecture of JUQUEEN 
as well as the x86-based hardware of SuperMUC. Since, in waLBerla, the perfor¬ 
mance of the fastest TRT kernel is identical to the performance of the fastest SRT 
kernel [24], all results reported here apply for both collision schemes, TRT and SRT. 

For evaluating weak scaling performance on SuperMUC, we use 3.34 million cells 
per core. As a consequence, with 32,768 cores of SuperMUC, we can execute simula¬ 
tions that consist of up to 110 billion cells. For these simulations, we achieve up to 
154,599 MLUPS/sec. Since the algorithms do not involve global communication and 
since our underlying data structures are fully distributed to all processes, we observe 
almost perfect weak scaling (see hrst graph in Figure 18). Hybrid simulations using 
four threads per process are slightly faster than simulations that rely on MPI only 
and simulations that use eight threads per process (see Table 3). For the evaluation of 
strong scaling performance, we fix the global number of cells to 8.56 million and run 
the same simulation on 512 up to 32,768 cores. In the largest simulation with 32,768 
cores, each core is responsible for only 261 cells. As expected, parallel efficiency de¬ 
creases when more processes are used and when fewer cells are assigned to each core. 
However, throughput constantly increases up to 1,114 time steps per second on the 
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Table 3 

Detailed results for weak scaling (3.34 million cells/core) and strong scaling (8.56 million cells 
in total) performance on SuperMUC. The table lists results for simulations that rely on MPI only 
and for hybrid simulations with four and eight threads per process. 


cores 

weak scaling (MLUPS / 

sec-core) 

strong scaling (time steps / sec) 

MPI only 

OpenMP (4) 

OpenMP (8) 

MPI only 

OpenMP (4) 

OpenMP (8) 

128 

4.80022 

4.85441 

4.75645 




256 

4.77671 

4.89659 

4.64531 




512 

4.76295 

4.83518 

4.63247 

127.04 

155.34 

145.64 

1024 

4.75190 

4.80719 

4.64152 

191.40 

263.06 

248.78 

2048 

4.72467 

4.78511 

4.63214 

289.64 

422.42 

436.48 

4096 

4.68737 

4.76766 

4.63984 

412.57 

650.54 

667.65 

8192 

4.71740 

4.75104 

4.60749 

593.36 

853.99 

775.05 

16384 

4.64417 

4.72311 

4.60263 

799.10 

991.72 

822.86 

32768 

4.62946 

4.71798 

4.57990 

1057.57 

1113.97 

1018.27 


finest grid (see second graph in Figure 18), enabling simulations that perform up to 
4 million time steps per hour compute time. Hybrid simulations are generally faster 
(cf. Table 3) since for hybrid simulations fewer processes with larger blocks per pro¬ 
cess can be used. As a result, the amount of data communicated between neighboring 
processes is smaller compared to simulations that use MPI parallelization only. 

On JUQUEEN, we observe qualitatively the same behavior. However, in contrast 
to SuperMUC, JUQUEEN provides many more processor cores. The simulations that 
run on the entire machine make use of 1.835 million threads. Since these large simu¬ 
lations require data structures that consist of millions of blocks, we make use of our 
two-stage initialization process and load the results of the initialization phase from file 
instead of creating the domain partitioning and performing load balancing at runtime 
(see section 3.3). For the evaluation of weak scaling performance, we use 1.93 million 
cells per core, resulting in almost one trillion cells (886 billion) for the largest simu¬ 
lation. Similar to our results on SuperMUC, we observe almost perfect weak scaling 
(see first graph in Figure 19), with hybrid simulations using four threads per process 
being significantly faster than simulations that rely on MPI only and simulations that 
use eight threads per process (see Table 4). With 889,602 MLUPS/sec for the largest 
simulation on 458,752 cores, we are updating 16.9 trillion PDFs in each second. For 
the evaluation of strong scaling performance, we fix the global number of cells to 233 
million and run the same simulation on 512 up to 458,752 cores. In simulations on 
458,752 cores, each core is responsible for only 508 cells. Again, we observe parallel 
efficiency decreasing when more processes are used and when fewer cells are assigned 
to each core. Throughput, however, constantly increases up to 98 time steps per sec¬ 
ond on the finest grid (see second graph in Figure 19). In order to achieve maximal 
throughput, we must use a hybrid parallelization approach (cf. Table 4). 

Compared to benchmarks for simulations on uniform grids [24], our non-uniform 
LBM scheme that uses statically refined grid structures looses a factor of 2 to 2.5 
in terms of MLUPS/sec numbers. This is due to the fact that in the uniform LBM 
scheme, there is significantly less overhead. The uniform scheme does not need to 
perform interpolation, it communicates less data, and it can always make use of fused 
stream-collide compute kernels. In [24], strong scaling performance is only reported 
for a simulation of blood flow in the coronary artery tree. In this kind of simulation, 
blocks are less densely populated with fluid cells, resulting in less data that needs to be 
processed and communicated. Throughput in this type of simulation is typically two 
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Fig. 19. lUeaA; (left) and strong (right) scaling performance on JUQUEEN. For evaluating 
weak scaling performance, the benchmark uses 1.93 million cells/core. For evaluating strong scaling 
performance, the global number of cells is fixed to 233 million. Only the best results of three different 
parallelization strategies (MPI only, MPI+OpenMP with 4 threads/process, MPI+OpenMP with 8 
threads/process) are plotted. Detailed results are listed in Table 4- 

Table 4 

Detailed results for weak scaling (1.93 million cells/core) and strong scaling (233 million cells 
in total) performance on JUQUEEN. The table lists results for simulations that rely on MPI only 
and for hybrid simulations with four and eight threads per process. 


cores 

weak scaling (MLUPS / sec-core) 

strong scaling (time steps / sec) 

MPI only 

OpenMP (4) 

OpenMP (8) 

MPI only 

OpenMP (4) 

OpenMP (8) 

32 

1.72690 

2.06960 

1.70424 




64 

1.72024 

2.05401 

1.75702 




128 

1.70207 

2.04498 

1.67617 




256 

1.70720 

2.00083 

1.71336 




512 

1.69481 

1.98645 

1.70756 

4.083 

4.937 

4.942 

1024 

1.70529 

1.98478 

1.67012 

6.804 

7.961 

8.641 

2048 

1.67337 

1.97916 

1.67471 

10.464 

12.460 

13.280 

4096 

1.66438 

1.97675 

1.67879 

15.983 

21.644 

23.128 

8192 

1.65579 

1.97816 

1.67131 

20.527 

30.990 

30.129 

16384 

1.62633 

1.95565 

1.65981 

26.633 

45.913 

46.443 

32768 

1.62369 

1.94555 

1.67588 

34.335 

58.519 

65.057 

65536 

1.60442 

1.94437 

1.66758 

39.391 

70.840 

73.794 

131072 

1.58830 

1.93697 

1.65841 

46.231 

78.146 

81.712 

262144 

1.58418 

1.92665 

1.65170 

51.478 

81.019 

89.843 

458752 

— 

1.92446 

1.64429 

— 

89.376 

98.017 


to three times higher than in simulations with blocks that are densely populated with 
fluid cells. When comparing the number of time steps that can be executed in one 
second by our non-uniform LBM scheme to the throughput in uniform simulations, 
we observe a decrease by a factor of 3 to 4 for simulations with blocks that are densely 
populated with fluid cells. However, this loss in efficiency is more than compensated 
by the fact that, in general, non-uniform simulations based on locally refined grid 
structures generate significantly less workload and require much less memory than 
simulations that uniformly resolve the entire domain with the highest resolution. 
With more than 1,000 time steps per second, our benchmark with 4 different levels 
of refinement achieves considerably higher throughput rates on SuperMUC than on 
JUQUEEN. This is also in good agreement with previous observations in [24]. 

Finally, we turn to an application-oriented example to demonstrate the potential 
of the presented algorithms. Figure 20 illustrates a phantom geometry of the vocal 
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Fig. 20. Domain partitioning and flow profile of a preliminary study on the vocal fold and voice 
generation. Only the block partitioning is shown. The simulation uses five levels of refinement. The 
highest resolution is required only in the gap between the two folds (blocks highlighted in red). 


fold of a human as it is used to study the voice generation within the human throat [7]. 
Using grid refinement with five different levels of resolution, we can perform direct 
numerical simulations of flows with a Reynolds number of 1,000. Here, the domain is 
partitioned into 25,800 blocks with 16 x 16 x 16 cells per block, resulting in a total of 
105,676,800 cells. With 4,300 processes on SuperMUC, we achieve close to 100 time 
steps per second. Due to five levels of refinement, 55.2 times less memory is required 
and 98.2 times less workload is generated compared to the same simulation with the 
entire domain refined to the finest level. 

6. Conclusion and outlook. In this article, we have presented the main build¬ 
ing blocks that we have implemented into the software framework WALBerla that 
enable massively parallel simulations with the LBM on non-uniform grids. We have 
introduced the key concepts behind our distributed data structures including a forest 
of octrees-based domain partitioning into blocks that provides the foundation for ex¬ 
cellent scalability to current supercomputers. Using these data structures as a basis, 
we have discussed a distributed memory parallelization approach for the grid refine¬ 
ment technique developed in [50]. Furthermore, we have presented an asynchronous 
communication scheme that requires only minimal communication and a level-wise 
load balancing strategy that perfectly fits the structure of the algorithms involved. 
Additionally, we have proposed a method for scaling the two relaxation parameters 
of the two-relaxation-time collision model across different grid resolutions. 

We have verified the correctness of our parallel scheme for two Poiseuille flow 
scenarios. Furthermore, we have demonstrated not only near-perfect weak scalability 
on two current petascale supercomputers but also an absolute performance of close to 
a trillion cell updates per second for a simulation with almost two million concurrent 
threads and a total number of 886 billion cells. To our best knowledge, this is the 
largest computation to date with the LBM on non-uniform grids, significantly exceed¬ 
ing the data previously published [21, 30, 40, 51]. Strong scaling a simulation with 
several millions of cells, we have reached a performance of less than one millisecond 
per time step. A hybrid parallel execution model where multiple OpenMP threads 
are executed per MPl process proves to be essential for best performance, especially 
when strong scaling simulations to massively parallel systems. 

Future work will deal with the support of dynamic grid refinement and runtime 
adaptivity. To this end, the data structure managing the block partitioning must allow 
for splitting, merging, and exchanging blocks between different processes at runtime. 
As a result, dynamic load balancing strategies must be implemented. Since our data 
structure allows to run octree-based as well as general graph-based algorithms (see 
section 3.2), dynamic load balancing can be based on space filling curves following, 
e.g., ideas published in [12], or it can be based on diffusive algorithms discussed, e.g.. 
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in [41, 42], 

Reproducibility. All algorithms presented in this article are implemented in 
the waLBerla software framework. All the benchmarks used in section 5 were also 
added to the framework and are available as part of the software. Researchers are 
encouraged to contact the authors of the framework through http://walberla.net in 
order to get access to the code. 
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